AITopics | factual question

Collaborating Authors

factual question

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

ChiMDQA: Towards Comprehensive Chinese Document QA with Fine-grained Evaluation

Gao, Jing, Luo, Shutiao, Liu, Yumeng, Li, Yuanming, Zeng, Hongji

arXiv.org Artificial IntelligenceNov-6-2025

With the rapid advancement of natural language processing (NLP) technologies, the demand for high-quality Chinese document question-answering datasets is steadily growing. To address this issue, we present the Chinese Multi-Document Question Answering Dataset(ChiMDQA), specifically designed for downstream business scenarios across prevalent domains including academic, education, finance, law, medical treatment, and news. ChiMDQA encompasses long-form documents from six distinct fields, consisting of 6,068 rigorously curated, high-quality question-answer (QA) pairs further classified into ten fine-grained categories. Through meticulous document screening and a systematic question-design methodology, the dataset guarantees both diversity and high quality, rendering it applicable to various NLP tasks such as document comprehension, knowledge extraction, and intelligent QA systems. Additionally, this paper offers a comprehensive overview of the dataset's design objectives, construction methodologies, and fine-grained evaluation system, supplying a substantial foundation for future research and practical applications in Chinese QA. The code and data are available at: https://anonymous.4open.science/r/Foxit-CHiMDQA/.

large language model, machine learning, question answering, (20 more...)

arXiv.org Artificial Intelligence

2511.03656

Country:

North America > United States (0.68)
Asia > China (0.48)

Genre: Research Report > New Finding (0.68)

Industry:

Health & Medicine (0.88)
Education > Curriculum > Subject-Specific Education (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Question Answering (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

The SMeL Test: A simple benchmark for media literacy in language models

Ahdritz, Gustaf, Kleiman, Anat

arXiv.org Artificial IntelligenceAug-8-2025

The internet is rife with unattributed, deliberately misleading, or otherwise untrustworthy content. Though large language models (LLMs) are often tasked with autonomous web browsing, the extent to which they have learned the simple heuristics human researchers use to navigate this noisy environment is not currently known. In this paper, we introduce the Synthetic Media Literacy Test (SMeL Test), a minimal benchmark that tests the ability of language models to actively filter out untrustworthy information in context. We benchmark a variety of commonly used instruction-tuned LLMs, including reasoning models, and find that no model consistently succeeds; while reasoning in particular is associated with higher scores, even the best API model we test hallucinates up to 70% of the time. Remarkably, larger and more capable models do not necessarily outperform their smaller counterparts. We hope our work sheds more light on this important form of hallucination and guides the development of new methods to combat it.

large language model, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

2508.02074

Country:

Asia (0.67)
North America > United States (0.28)
North America > Mexico (0.28)

Genre: Research Report (0.83)

Industry:

Government (0.93)
Media > News (0.68)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Procedural Knowledge in Pretraining Drives Reasoning in Large Language Models

Ruis, Laura, Mozes, Maximilian, Bae, Juhan, Kamalakara, Siddhartha Rao, Talupuru, Dwarak, Locatelli, Acyr, Kirk, Robert, Rocktäschel, Tim, Grefenstette, Edward, Bartolo, Max

arXiv.org Artificial IntelligenceNov-19-2024

The capabilities and limitations of Large Language Models have been sketched out in great detail in recent years, providing an intriguing yet conflicting picture. On the one hand, LLMs demonstrate a general ability to solve problems. On the other hand, they show surprising reasoning gaps when compared to humans, casting doubt on the robustness of their generalisation strategies. The sheer volume of data used in the design of LLMs has precluded us from applying the method traditionally used to measure generalisation: train-test set separation. To overcome this, we study what kind of generalisation strategies LLMs employ when performing reasoning tasks by investigating the pretraining data they rely on. For two models of different sizes (7B and 35B) and 2.5B of their pretraining tokens, we identify what documents influence the model outputs for three simple mathematical reasoning tasks and contrast this to the data that are influential for answering factual questions. We find that, while the models rely on mostly distinct sets of data for each factual question, a document often has a similar influence across different reasoning questions within the same task, indicating the presence of procedural knowledge. We further find that the answers to factual questions often show up in the most influential data. However, for reasoning questions the answers usually do not show up as highly influential, nor do the answers to the intermediate reasoning steps. When we characterise the top ranked documents for the reasoning questions qualitatively, we confirm that the influential documents often contain procedural knowledge, like demonstrating how to obtain a solution using formulae or code. Our findings indicate that the approach to reasoning the models use is unlike retrieval, and more like a generalisable strategy that synthesises procedural knowledge from documents doing a similar form of reasoning.

artificial intelligence, large language model, natural language, (19 more...)

arXiv.org Artificial Intelligence

2411.1258

Country:

North America > United States (0.14)
North America > Canada > Ontario > Toronto (0.14)
Asia > Nepal (0.04)
(13 more...)

Genre: Research Report > New Finding (1.00)

Industry: Health & Medicine > Therapeutic Area > Infections and Infectious Diseases (0.45)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Add feedback

Can AI chatbots be reined in by a legal duty to tell the truth?

New ScientistAug-7-2024, 00:01:32 GMT

Can artificial intelligence be made to tell the truth? Probably not, but the developers of large language model (LLM) chatbots should be legally required to reduce the risk of errors, says a team of ethicists. "What we're just trying to do is create an incentive structure to get the companies to put a greater emphasis on truth or accuracy when they are creating the systems," says Brent Mittelstadt at the University of Oxford. How does ChatGPT work and do AI-powered chatbots "think" like us? LLM chatbots, such as ChatGPT, generate human-like responses to users' questions, based on statistical analysis of vast amounts of text. But although their answers usually appear convincing, they are also prone to errors – a flaw referred to as "hallucination".

factual question, language model, mittelstadt, (6 more...)

New Scientist

Country:

Europe > United Kingdom > England > Oxfordshire > Oxford (0.25)
Europe > Netherlands (0.05)

Industry: Law (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.73)

Add feedback

Benchmarking Trustworthiness of Multimodal Large Language Models: A Comprehensive Study

Zhang, Yichi, Huang, Yao, Sun, Yitong, Liu, Chang, Zhao, Zhe, Fang, Zhengwei, Wang, Yifan, Chen, Huanran, Yang, Xiao, Wei, Xingxing, Su, Hang, Dong, Yinpeng, Zhu, Jun

arXiv.org Artificial IntelligenceJun-11-2024

Despite the superior capabilities of Multimodal Large Language Models (MLLMs) across diverse tasks, they still face significant trustworthiness challenges. Yet, current literature on the assessment of trustworthy MLLMs remains limited, lacking a holistic evaluation to offer thorough insights into future improvements. In this work, we establish MultiTrust, the first comprehensive and unified benchmark on the trustworthiness of MLLMs across five primary aspects: truthfulness, safety, robustness, fairness, and privacy. Our benchmark employs a rigorous evaluation strategy that addresses both multimodal risks and cross-modal impacts, encompassing 32 diverse tasks with self-curated datasets. Extensive experiments with 21 modern MLLMs reveal some previously unexplored trustworthiness issues and risks, highlighting the complexities introduced by the multimodality and underscoring the necessity for advanced methodologies to enhance their reliability. For instance, typical proprietary models still struggle with the perception of visually confusing images and are vulnerable to multimodal jailbreaking and adversarial attacks; MLLMs are more inclined to disclose privacy in text and reveal ideological and cultural biases even when paired with irrelevant images in inference, indicating that the multimodality amplifies the internal risks from base LLMs. Additionally, we release a scalable toolbox for standardized trustworthiness research, aiming to facilitate future advancements in this important field. Code and resources are publicly available at: https://multi-trust.github.io/.

gpt-4-vision sharegpt4v llava-rlhf llava-next, internvl-chat internlm-xc internlm-xc2, lvis-instruct4v llava-rlhf llava-next minigpt-4-13b minigpt-4-l2, (16 more...)

arXiv.org Artificial Intelligence

2406.07057

Country:

North America > United States > New York > New York County > New York City (0.14)
Asia > China > Shanghai > Shanghai (0.04)
Africa > Ethiopia (0.04)
(7 more...)

Genre:

Overview (1.00)
Research Report > New Finding (0.45)
Research Report > Promising Solution (0.45)

Industry:

Information Technology > Security & Privacy (1.00)
Health & Medicine (1.00)
Energy (1.00)
(3 more...)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Question Answering as Programming for Solving Time-Sensitive Questions

Zhu, Xinyu, Yang, Cheng, Chen, Bei, Li, Siheng, Lou, Jian-Guang, Yang, Yujiu

arXiv.org Artificial IntelligenceOct-20-2023

Question answering plays a pivotal role in human daily life because it involves our acquisition of knowledge about the world. However, due to the dynamic and ever-changing nature of real-world facts, the answer can be completely different when the time constraint in the question changes. Recently, Large Language Models (LLMs) have shown remarkable intelligence in question answering, while our experiments reveal that the aforementioned problems still pose a significant challenge to existing LLMs. This can be attributed to the LLMs' inability to perform rigorous reasoning based on surface-level text semantics. To overcome this limitation, rather than requiring LLMs to directly answer the question, we propose a novel approach where we reframe the $\textbf{Q}$uestion $\textbf{A}$nswering task $\textbf{a}$s $\textbf{P}$rogramming ($\textbf{QAaP}$). Concretely, by leveraging modern LLMs' superior capability in understanding both natural language and programming language, we endeavor to harness LLMs to represent diversely expressed text as well-structured code and select the best matching answer from multiple candidates through programming. We evaluate our QAaP framework on several time-sensitive question answering datasets and achieve decent improvement, up to $14.5$% over strong baselines. Our codes and data are available at https://github.com/TianHongZXY/qaap

datetime, information, llm, (17 more...)

arXiv.org Artificial Intelligence

2305.14221

Country:

Asia > China > Liaoning Province > Dalian (0.04)
South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
North America > United States > Oregon > Klamath County > Klamath Falls (0.04)
(8 more...)

Genre:

Personal (0.93)
Research Report > New Finding (0.46)

Industry:

Education (0.93)
Government > Regional Government > North America Government > United States Government (0.93)
Leisure & Entertainment > Sports > Soccer (0.68)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Add feedback

Improving alignment of dialogue agents via targeted human judgements

Glaese, Amelia, McAleese, Nat, Trębacz, Maja, Aslanides, John, Firoiu, Vlad, Ewalds, Timo, Rauh, Maribeth, Weidinger, Laura, Chadwick, Martin, Thacker, Phoebe, Campbell-Gillingham, Lucy, Uesato, Jonathan, Huang, Po-Sen, Comanescu, Ramona, Yang, Fan, See, Abigail, Dathathri, Sumanth, Greig, Rory, Chen, Charlie, Fritz, Doug, Elias, Jaume Sanchez, Green, Richard, Mokrá, Soňa, Fernando, Nicholas, Wu, Boxi, Foley, Rachel, Young, Susannah, Gabriel, Iason, Isaac, William, Mellor, John, Hassabis, Demis, Kavukcuoglu, Koray, Hendricks, Lisa Anne, Irving, Geoffrey

arXiv.org Artificial IntelligenceSep-28-2022

We present Sparrow, an information-seeking dialogue agent trained to be more helpful, correct, and harmless compared to prompted language model baselines. We use reinforcement learning from human feedback to train our models with two new additions to help human raters judge agent behaviour. First, to make our agent more helpful and harmless, we break down the requirements for good dialogue into natural language rules the agent should follow, and ask raters about each rule separately. We demonstrate that this breakdown enables us to collect more targeted human judgements of agent behaviour and allows for more efficient rule-conditional reward models. Second, our agent provides evidence from sources supporting factual claims when collecting preference judgements over model statements. For factual questions, evidence provided by Sparrow supports the sampled response 78% of the time. Sparrow is preferred more often than baselines while being more resilient to adversarial probing by humans, violating our rules only 8% of the time when probed. Finally, we conduct extensive analyses showing that though our model learns to follow our rules it can exhibit distributional biases.

machine learning, natural language, reinforcement learning, (24 more...)

arXiv.org Artificial Intelligence

2209.14375

Genre:

Research Report > New Finding (1.00)
Personal (1.00)

Technology:

Information Technology > Communications > Social Media (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
(2 more...)

Add feedback

A Study on the Manifestation of Trust in Speech

Gauder, Lara, Pepino, Leonardo, Riera, Pablo, Brussino, Silvina, Vidal, Jazmín, Gravano, Agustín, Ferrer, Luciana

arXiv.org Artificial IntelligenceFeb-9-2021

Research has shown that trust is an essential aspect of human-computer interaction directly determining the degree to which the person is willing to use a system. An automatic prediction of the level of trust that a user has on a certain system could be used to attempt to correct potential distrust by having the system take relevant actions like, for example, apologizing or explaining its decisions. In this work, we explore the feasibility of automatically detecting the level of trust that a user has on a virtual assistant (VA) based on their speech. We developed a novel protocol for collecting speech data from subjects induced to have different degrees of trust in the skills of a VA. The protocol consists of an interactive session where the subject is asked to respond to a series of factual questions with the help of a virtual assistant. In order to induce subjects to either trust or distrust the VA's skills, they are first informed that the VA was previously rated by other users as being either good or bad; subsequently, the VA answers the subjects' questions consistently to its alleged abilities. All interactions are speech-based, with subjects and VAs communicating verbally, which allows the recording of speech produced under different trust conditions. Using this protocol, we collected a speech corpus in Argentine Spanish. We show clear evidence that the protocol effectively succeeded in influencing subjects into the desired mental state of either trusting or distrusting the agent's skills, and present results of a perceptual study of the degree of trust performed by expert listeners. Finally, we found that the subject's speech can be used to detect which type of VA they were using, which could be considered a proxy for the user's trust toward the VA's abilities, with an accuracy up to 76%, compared to a random baseline of 50%.

experiment, speech, system error, (17 more...)

arXiv.org Artificial Intelligence

2102.0937

Country:

Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.14)
South America > Argentina > Pampas > Buenos Aires F.D. > Buenos Aires (0.04)
North America > Canada > Quebec > Montreal (0.04)
(5 more...)

Genre:

Research Report > Experimental Study (1.00)
Research Report > New Finding (0.93)

Industry:

Health & Medicine (0.46)
Government (0.46)

Technology:

Information Technology > Human Computer Interaction (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Personal Assistant Systems (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
(2 more...)

Add feedback